Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis

Background Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects’ metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts. Methods Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3). Results Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric. Conclusion The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.


Background
Amidst the information era in which the digitalization of healthcare applications is being streamlined, privacy is evolving towards an increasingly precious asset.Data protection regulations, such as the GDPR (General Data Protection Regulation) in Europe or the HIPAA (Health Insurance Portability and Accountability Act) in the USA, regulate the processing of Personally Identifiable Information (PII) by, for example, defining the rights of the data subjects, the responsibilities of data processing agents, or various codes of conduct [1].However, attacks against theoretically secure private datasets are becoming increasingly sophisticated and have managed to break security barriers down [2,3], which obliges data collectors, controllers, and processors, to implement harsher privacy safeguards.
A relatively novel and mainstream technology to help data leakage risks is Synthetic Data Generation (SDG).This technology consists of creating artificial versions of real data (RD) using generative models, statistical models or models based on expert knowledge that can learn underlying variable distributions and multivariate correlations [4].Once an SDG model is trained, as many data records as wanted can be synthesized, following the original patterns from the real dataset.However, even if SDG techniques offer certain privacy guarantees, current literature on the topic is evolving towards determining whether the generated synthetic data (SD) is still considered personal data or not [5].
Among the different sectors in which SDG can be used, healthcare is critical in terms of sensitive information, and thus one of the most compromised information sources that needs to be controlled and monitored [6].In this context, it is common the data to have two components, the first one being classified as subjects' metadata (the static component, which does not change in time) and the second one being the time series or longitudinal part (the dynamic component, measured across a time axis).Up to now, clinical SDG has mostly been focused on generating private and useful tabular static data, while the longitudinal component, and the joint generation of both, have fallen behind.Considering this scenario, it is worth noting the need to ensure the privacy of both mentioned components [7].The combination of metadata with time series could potentially help re-identification and compromised data breaches, but synthesizing time series brings inherent challenges with it, as temporal correlations must be learnt by data generators apart from the usual intervariable correlations.
Regarding generative models, the first Generative Adversarial Network (GAN) architecture was presented by Goodfellow et al. in 2014 [8].Since then, several modified versions have been published to improve its performance for specific tasks in a healthcare context [9].As for synthesizing time series, TimeGAN [10] was the first specific variant that paved the way for newer networks that can also handle the temporality of the data [11].However, how time series that are linked to metadata are synthesized remains a field for further research, as current models are solely focused on generating time series, or they generate them jointly with the associated metadata, still having a gap on the comparison analysis of these latter approaches, which is the primary focus of the current work.
In 2022, we proposed two different approaches for the Synthetic Time Series Generation (STSG) task [12], where different combinations of real and synthetic metadata and time series are merged to find the best approach for achieving useful time series.In the present work, a more extensive evaluation of both approaches is carried out assessing the resemblance of the SD with respect to the RD (1), how useful the SD is to build downstream applications (2), and how private the generated information is (3).The alignment and balance of these components is still an open problem in the current literature, as there exists a complex relationship between the three.A strong privacy guarantee is related to a higher bias of the SD with respect to the RD, which makes the utility and the resemblance of the former to decrease.In contrast, maintaining a higher data utility results in compromised SD, since personal information could potentially be extracted from it due to the extensive resemblance between SD and RD.Furthermore, the relevance of each component depends on the target application, as there are critical examples that need to address the privacy issue and sacrifice the resemblance and utility of the SD, but it is not always the case.Moreover, the models we evaluated in Isasa et al. [13] were included as a third STSG approach, where the metadata and the time series are jointly generated using a single model.Drawing from our prior studies, this is, to our understanding, the first work that evaluates three distinct variants to generate SD that combines time series and related metadata.A common evaluation procedure is employed to assess the resemblance, utility and privacy of the results, thereby making a significant contribution to the field.
The remainder of this document is organised as follows: the "Methods" section provides a detailed description of how the SD was generated using multiple model and dataset combinations, as well as how the data was coupled and evaluated after the generation step.Next, the results of the assessment are presented, followed by a discussion, as well as a list of limitations that were found across this research.Finally, the overall conclusion of the comparison is presented.

Methods
This section first describes the datasets utilized for this work and the preprocessing steps needed to prepare the data.Then, three different STSG approaches are presented, followed by the used SDG models.Finally, the evaluation of the generated SD is explained.

Data sources
To ensure outcome consistency devoid of data reliance, the experiments were conducted using two different datasets sourced from Physionet [14] and a third lifelogging dataset named PMData.
First, a dataset consisting of cardiorespiratory measurements acquired during 992 Treadmill Maximal Exercise Tests (TMET) performed by amateur and professional athletes was considered [15,16].The experiments comprised longitudinal measurements of Graded Exercise Tests (GETs) as well as subject-related metadata, and they were carried out on a Powerjog J Series treadmill to which a CPX MedGraphics gas analyzer and a Mortara 12-lead ECG device were connected.With those, the Heart Rate (HR) of the subjects was recorded, along with the oxygen consumption (VO 2 ), carbon dioxide generation (VCO 2 ), and pulmonary ventilation measures (Respiratory Rate, RR; Exhaled Volume, VE).As for the metadata component, the age (M: 27.10, Q1-Q3: 21.10-36.32),sex (5.66:1 male to female ratio), height (M: 175.00,Q1-Q3: 170.00-180.00),and weight (M: 73.00, Q1-Q3: 66.00-80.23) of the subjects were obtained, along with ambient measurements such as air temperature (M: 22.90, Q1-Q3: 20.80-24.40)and humidity (M: 47.00, Q1-Q3: 42.00-54.00).The GETs comprised a warmup period at a 5 km/h pace, and speed increments ranging from 0.5 to 1 km/h were applied until the consumed oxygen volume was saturated.The subjects recovered the effort at the initial velocity of 5 km/h.The research, published by the Exercise Physiology and Human Performance Lab of the University of Malaga, complied with all the relevant national regulations, and was performed following the tenets of the Helsinki Declaration.The study protocols were approved by the Research Ethics Committee of the University of Malaga, and written informed consent was collected from all the participants or their legal representatives.
Second, a hypotension subset was derived from the MIMIC-III v1.4 (Medical Information Mart for Intensive Care) database [17,18].The database consists of deidentified health-related data from patients that stayed in the critical care unit of the Beth Israel Deaconess Medical Center, in Boston, between 2001 and 2012.Two different critical care information systems were utilized to record patient information: the Philips CareVue Clinical Information System (Philips Healthcare, Andover, MA), and the iMDsoft Metavision ICU (iMDsoft, Needham, MA).Using those, diastolic and systolic pressures, fraction of inspired oxygen (FiO 2 ), urine outputs, and administered vasopressors were monitored.Moreover, Glasgow Coma Scale (GCS) scores were recorded, and the longitudinal data was merged with consequent gender and ethnicity metadata.The hypotension subset was obtained by selecting adult (> 18 years) patients having a Mean Arterial Pressure (MAP) lower than 65 mmHg [19].All the data is in line with the data protection requirements in the HIPAA, and the project was approved by the Institutional Review Boards of the Beth Israel Deaconess Medical Center (Boston, MA), and the Massachusetts Institute of Technology (Cambridge, MA).
Third, a lifelogging dataset was incorporated to the analysis, named PMData [20].FitBit Versa 2 smartwatch wristbands were carried by 16 (13 men and 3 women) people for five months for data collection.The information was enriched with PMSys sports logging application data, as well as a set of responses to a questionnaire that aimed to collect static information (age, gender, height, and whether the subject has a type A or a type B personality).On the other hand, time series that were acquired during the study ranged from weekly recordings to everyfive-second measurements.All participants on this study signed an informed consent form, and the research was carried out under the evaluation of the Norwegian Centre for Research Data (NSD) and in accordance with Norwegian and EU data protection laws.

Data preprocessing
First, the TMET dataset was preprocessed following the steps defined in Larrea et al. [12].Starting from a 992-subject dataset, 30 of them were directly excluded due to missing values on subject metadata.Regarding the longitudinal component, experiments with more than 30 lacking data points were excluded, while the others were linearly interpolated to avoid missing values.Since 942 samples were found missing for the HR variable and 4,871 samples were lacking for both VO 2 and VCO 2 variables, 12 more experiments were excluded from the research.The final TMET dataset consisted of 950 experiments, resulting in 216,600 time stamps overall for a constant sequence length of 228 samples.
Second, the hypotension subset derived from MIMIC-III was generated by directly querying the Google Cloud Platform (GCP) where it is made available.Subjects with missing values in either the systolic or the diastolic pressure measurements were excluded from the final dataset.Assuming missing values on the remaining measurements would entail no alterations on the subjects' previous state, the time series were imputed using a technique called forward filling.This technique uses the last valid measurement of each variable to determine the values of the missing cells until a new valid measurement is found.The categorical variables on this dataset were label-encoded for the DGAN, as it is a model-specific requirement.One-hot encoding was considered better for the WGAN-GP not to cause misinterpretations to the model.The gender variable was binarized, while the ethnicities were clustered into five classes.Finally, 674 subjects were included in the dataset, each of them consisting of 48 hourly measurements and resulting in a total of 32,352 time stamps.
Last, the PMData-based dataset for the experiment was built by selecting the heart rate variable (a measurement was acquired every five seconds) from the longitudinal information and joining it to the associated metadata (age, height, gender, and personality), starting the preprocessing step with a dataset size of 20,991,392 samples and seven variables.Considering the subjects contained variable length heart rate series, the length of the shortest series was chosen to truncate the remaining ones.The remaining dataset consisted of 10,155,936 samples with a time series length of 634,746 samples per subject.Finally, the generation process of the synthetic time series had to be limited to a length of 50,000 samples due to computational resource limitations.

Synthetic time series generation approaches
In this study, three different approaches to generate synthetic time series are presented and compared, namely synthetic metadata generation and real time series coupling or A1, separate synthetic metadata and time series generation and later coupling or A2, and joint synthetic metadata and time series generation, A3.The first two by Larrea et al. [12] were earlier introduced in the manuscript, as well as the third approach considered by Isasa et al. in [13].In this section, more detailed explanations about the different approaches can be found supported by Fig. 1, which graphically represents the workflow each method follows to generate SD.
The first approach (A1) assumes that no private information is disclosed within real time series data.This approach was inspired by Schiff et al. [21], which is based on enriching synthetic subject metadata with real time series.To do that, meaningful statistical measurements (maximum, minimum, and mean values of each time series) are calculated from the real measurement sequence to be treated as metadata.Next, a synthetic counterpart of the metadata is generated, consisting of synthetic subject metadata and synthetic time series statistical measurements.Finally, the synthetic metadata is coherently coupled to the real time series by pairing the generated values with the real ones so that the distance between them is as small as possible.
As for the second approach (A2), the same meaningful statistical measurements are calculated from the set of real time series.However, in this case, a synthetic set of time series is separately generated.Then, the same coherent coupling is conducted between the synthetic metadata and synthetic time series that were separately generated.Lastly, the third approach (A3) assumes that subject metadata positively impacts the quality of the generated time series.In this case, subject metadata and time series are generated in a single step.

Synthetic data generation models
The approaches presented in the previous section were tested using two different SDG models that can incorporate metadata into the time series generation task.Hyperparameter values were experimentally selected for both models and remained unchanged among approximation and dataset combinations.
The Wasserstein GAN with Gradient Penalty (WGAN-GP) was first used with the additional alignment loss feature proposed in the Health Gym project [22].The alignment loss shapes the generator loss function ( L G ) by calculating the L1 norm of the disparity between intervariable correlation coefficients in the real and synthetic datasets, as defined below: where is the alignment coefficient, r i,j is the correlation coefficient between two synthetically generated variables and r i,j is its real counterpart.With this loss function, the generator network gets more penalized if the generated data is not appropriately correlated.
The WGAN-GP was trained for 5,000 epochs using a batch size of 64 samples for the TMET dataset, while for the hypotension subset, a batch size of 32 samples was selected.For the PMData dataset, a batch size of 2 was used due to the low number of subjects.The critic network was updated 5 times more than the generator network at a constant learning rate of 0.001 for all the datasets.Also, the alignment coefficient remained constant at a value of 10.0.
The second generative model used in this research is the DöppelGANger or DGAN [23].The DGAN first generates the metadata to posteriorly generate time series conditionally on top of it.A Multi-Layer Perceptron (MLP) generates the metadata component, and the output is incorporated into a Recurrent Neural Network (RNN) at each time step, whose task is to generate the longitudinal data.In the DGAN, batch generation is used to generate S records per each RNN iteration, a process that aims to capture longer-term effects from the time series.Additionally, a complex discriminator architecture is incorporated into the model, by adding an auxiliary discriminator to focus solely on the metadata.A combined discriminator is responsible for ultimately classifying samples as real or fake.The batch sizes that were used to train the DGAN models were equal to the ones used for the WGAN-GP models.An S value of 12 was considered for the TMET dataset, a value of 6 for the hypotension subset, and a value of 127 for the PMData dataset.For these models, a constant learning rate of 0.001 was also utilized for a training process of 5,000 epochs, while the gradient penalty coefficient was set to 10.0 for the DGAN models.

Synthetic data evaluation
The SD generated using different approaches was evaluated against diverse metrics, quantifying its resemblance to the original data (1), its utility for building downstream applications (2), and the level of privacy of the synthetically generated data (3).
Since generative models produce different data assets each time they are used, a pseudo-cross-validation methodology was incorporated into the evaluation pipeline to eliminate the potential bias due to this aspect.Eight different equally dimensioned synthetic datasets per model and used dataset were generated and used as pseudocross-validation folds for this methodology.The results for each evaluation metric and method were combined by calculating the mean value of the different folds.The evaluation methods used in this analysis were mostly based on Hernandez et al. [24], who proposed a collection of objective and universal metrics for evaluating and comparing SDG models and approaches.
The resemblance metrics calculated in this work aim to assess how similar synthetic datasets are with respect to their real counterparts, considering statistical, distributional, and correlational characteristics between them.For this purpose, a multivariate analysis was performed first to check for inter-variable correlations.Numerical variable pairs were evaluated using the Pearson correlation coefficient, while categorical variable pairs were examined with the Cramér's V value that was calculated from previous χ 2 -tests.Mixed-type correlations were assessed using the R 2 coefficient of a linear regression of the numeric feature over the categorical one.A final correlation matrix was generated from the SD using the mean value of every fold on the pseudo-cross-validation process.The final matrix was compared to the correlations corresponding to the real dataset using the cosine similarity metric, which is defined as 1 − d cos , where a higher value denotes a better correlation preservation.
where both x i and x ′ i refer to the samples to be compared.Another method to evaluate how SD resembles real data is the Data Labelling Analysis (DLA).This process consists of training several classification models (Random Forests, k-Nearest Neighbors, Decision Trees, Support Vector Machines, and a MLP, in this case) to see if they can distinguish synthetic from real samples, simulating a GAN discriminator.The details on how each classification model was configured can be found in Table 1.
The prediction performance was assessed using the Area Under the Receiver Operating Characteristic curve (AUROC), random predictions implying values around 0.5, and the models being unable to distinguish samples from the different sources.
From the proposal of Sajjadi et al. [25], Precision Recall Distribution (PRD) plots were also generated to easily quantify the quality of the model using a single numeric value pair.In their publication, a notion of precision, α , and recall, β , is presented to compare two distributions, the former approximating the quality of the generated samples, and the latter measuring the overlapped proportion between them.Both α and β can be merged on a sin- gle metric by computing the maximum F-score as shown in the following equation: where γ is a positive real factor that weighs the recall with regard to the precision.For this analysis, a γ value of 8 was empirically selected for the evaluation, weighing the recall higher, although it was not of major importance as a relative analysis was prioritized over an absolute assessment.Moreover, adjustments in the γ value would result in an absolute repositioning of the values, with no alteration in the relative differences between approaches nor models.
Data utility refers to the degree to which the SD effectively mimics the original data for analytical or modelling purposes.Regarding that definition, in this work, a Train on Synthetic Test on Real (TSTR) setup was tested against a Train on Real Test on Real (TRTR) one.In these, various Machine Learning (ML) model twins are trained on RD first and SD afterwards, separately.Each twin is tested against an equal real testing set.A regression task was defined for the ML models in both cases, with the VE and GCS variables as the targets on the TMET dataset and the hypotension subset.In this work, each model's performance was evaluated with the Mean Absolute Error (MAE), and the models trained on SD were compared to the RD counterpart using the cosine similarity metric previously presented.This similarity metric is proportional to the MAE similarity, which gives information about the utility of the synthetic data.
The significance of the observed resemblance and utility metric differences between approaches was assessed using t-tests.For a Significance Level (SL) set at 0.05, a p-value below it implies the null hypothesis rejection, meaning that differences being tested are statistically significant.
Finally, in a SDG context, privacy can be defined as the extent to which the generated data protects the sensitive information by preventing the membership of real individuals being disclosed in the original dataset.In this analysis, the privacy was evaluated using a Membership Inference Attack (MIA) setup.In this attack, an adversary was considered to possess a proportion of the original data used to train the generative models, ranging from 10 to 50%.Also, SD was thought to be publicly available and, therefore, accessible to the adversary, as it is one of the strengths of SD.For every generated synthetic sample, the cosine distance was calculated with respect to every real sample in the original dataset.A distance value of 0.2 was used as a threshold to consider a synthetic sample similar enough to mark it as one coming from the RD used to train the generative models.As a longitudinal data setup, a single match between the two datasets led to the subject being marked as disclosed.

Results
In this section, the results of the resemblance, utility, and privacy analyses are presented.In the interest of clarity and simplicity, the models we refer to in this section are subscripted with letters T, M or P, referring to TMET, MIMIC or PMData, which indicate the dataset used in each case to train the model.Also, the p-values derived from all the t-test calculations that were performed during the analysis can be observed in Tables 2 and 3. Regarding the resemblance, the DGAN T model has resulted in near-perfect precision ( α : 0.975-0.987)and recall ( β : 0.974-0.994)values on the A1 PRD plot, as well as the DGAN M model ( α : 0.983-0.990,β : 0.984-0.994).The DGAN P model matched the recall ( β : 0.953-0.985)and precision metrics achieved by the previous models to a great extent, even if the latter's variability between different iterations was high ( α : 0.643-0.941).The A2 counterparts of the models have shown a higher deviation inside each pseudo-cross-validation fold, resulting in precision values ranging from 0.918 to 0.970 and recall values ranging from 0.785 to 0.893 for the TMET dataset, precision values from 0.906-0.939and recall values from 0.705-0.813for the hypotension subset, and precision values from 0.640-0.877and recall values between 0.472-0.794for the PMData dataset.The A3 approach for the DGAN noted further differences between datasets, resulting in a better performance with the TMET dataset ( α : 0.984-0.987,β : 0.933-0.950)with respect to the hypotension subset ( α : 0.966-0.976,β : 0.720-0.763),and even more with the PMData one ( α : 0.160-0.297,β : 0.425-0.704).

Table 1 Classification model parameters
Regarding the WGAN-GP T models, recall values from 0.974 to 0.987 and precision values around 0.987 were achieved after the pseudo-cross-validation evaluation of the A1 approach.For A2, the β value ranged from 0.903 to 0.934, and α fluctuated between 0.491 and 0.530, while for A3, the former was between 0.979 and 0.986, and the latter was between 0.865 and 0.886.For the WGAN-GP M models, the results on the A1 coupling varied between 0.933 and 0.947 for β and   between 0.918 and 0.951 for α .The A2 method led to worse results for both precision and recall, the former ranging from 0.814 to 0.884 and the latter varying between 0.610 and 0.679.Third, the joint generation of A3 resulted in β values between 0.932 and 0. 0.947 and α values between 0.918 and 0.951.Last, the best results among the WGAN-GP P models were achieved with the A1 approach, varying between an α of 0.483-0.636and a β of 0.791-0.960.A3 was the second best approach for the PMData dataset ( α : 0.041-0.047,β : 0.095-0.209),followed by A2 that did not achieve to cover the real distribution nor the quality of the original data ( α : 0, β : 0.003-0.008).In Fig. 2 the PRD plot with every pseudo-cross-validation α and β value pairs can be observed, and the F 8 -scores combining these values are shown in Table 4, where it can be observed that A1 outperforms the other two approaches.
The multivariate analysis on the DGAN T models has shown that the most realistic correlation matrix was achieved using the A3 approach, as it can be visually appreciated in Fig. 3 and numerically confirmed in Table 4. Regarding the WGAN-GP T model, the best correlation was achieved with the A1 approach, and the worst one was achieved with the A2 approach.Figure 3 shows how the subjects' metadata in A2 is not correlated to the time series set.
As for the DGAN M models, the A1 was the best-performing one, just as it can be observed in Fig. 4 and corroborated in Table 4.The WGAN-GP counterparts on the TMET dataset show worsening intervariable correlations that are proportional to the syntheticity of the data, best performing on A1 and worst performing on A3, just as the WGAN-GP models trained on the hypotension subset.
As for the third multivariate analysis, Fig. 5 shows the correlations that were calculated for the PMData dataset.The A3 approach for the DGAN P was the only model that managed to produce the correlations of the original dataset as the rest of the models missed to capture the ones that involved binary variables.The A1 approach was the best for the WGAN-GP P model, but the similarity metrics in Table 4 show poor results.
The DLA analysis on the TMET dataset proved to be similar for every approach, resulting in values near 0.9 in every experiment.Regarding the hypotension subset, the best results were achieved with the first approach (A1), as the predictions were closer to 0.5.The best-performing WGAN-GP T model on the DLA assessment was obtained with the A3 approach, even if closely followed by the other methods.Moreover, the assessment on the WGAN-GP M models show that the lowest classificability of the samples is achieved when following the A1  approach.There was no difference in the DLA analysis for the PMData dataset as the samples were properly distinguished by the classifiers in every case.
To finish with the resemblance assessment, the autocorrelation analysis was performed considering the timedependent variables in each dataset.As evaluating the A1 approach would not make sense (both the generated and the original data contain the same real time series), these values were set to use them as baselines for the other two methods.The A2 approach was the method that best emulated the autocorrelation metrics in A1 for the hypotension subset, the PMData dataset, and the WGAN-GP T model, while A3 was the best-performing one for DGAN T .
The utility metrics derived from the DGAN T models show the most useful data was generated using the A3 approach, while A1 was the worst-performing one, even if the statistical significance was not proven as it can be checked in Table 3.On the other hand, DGAN M models demonstrate the opposite for the hypotension subset, the first approach being the best-performing one (p > 0.05).Concerning the WGAN-GP models, the ones trained with the TMET dataset show the best approach is also A3, even if not far from the other two methods.
The WGAN-GP M combination performed the best using the A2 approach.No statistical test associated with these model and dataset combinations resulted in statistical significance.Regarding the PMData dataset, the A1 approach performed best in the DGAN case (p < 0.05), as well as for the WGAN-GP P (p < 0.05).In Table 5 the values obtained for the TSTR evaluation can be compared.
The last results to be presented are those of the privacy assessment.Regarding the WGAN-GP models, both datasets yielded similar values among approaches: the WGAN-GP T resulting in precision values around 0.7 with a slightly better outcome for A3 in the highest proportions known by the attacker, and the WGAN-GP M not being able to counter the MIA in any scenario.For the WGAN-GP P models, all the approaches resulted in private data as a zero precision was achieved during the MIA simulations.With respect to DGAN T , the best results were achieved with A2 and for low proportions of the dataset known by the adversary (0.1-0.3).For higher proportions (0.4-0.5), comparable results were obtained with every tested approach.The DGAN M model in A1 and A2 outperformed the A3 approach for the lowest proportions (0.1-0.3) with zero precision on the MIA setup, even if for the highest proportions the precision Fig. 4 Multivariate correlations of the real (left) and synthetic hypotension datasets.The upper row shows the data generated with the DGAN, while the lower row shows the data generated with the WGAN-GP.The three columns on the right refer to the different approaches followed during the synthesis.GCS: Glasgow Comma Scale.Note that blank values appear for features with a standard deviation equal to zero, which leads to a null value in correlation calculations deteriorated significantly.Last, the DGAN models trained with the PMData dataset resulted in private data again as the precision was zero for all the approaches.The MIA simulation results are graphically represented in Fig. 6.

Resemblance assessment
The resemblance results presented on the evaluation section were found to follow a pattern dependent on the initial characteristics of the dataset to be synthesized.Specifically, the time-series-to-metadata ratio, which refers to the number of longitudinal variables with respect to the static ones in a dataset, was found to predetermine resemblance metrics, as well as the length of the time series, as these are determinant on the data quantity that remains real in the A1 approach.In fact, the time-series-to-metadata ratio in the TMET dataset was 6:5, from which it can be inferred that 54.5% of the dataset variables were time series, while for the hypotension subset a ratio of 6:2 constituted the 75% of the dataset variables being longitudinal, and for the PMData dataset the ratio was 1:4.Considering this fact, A1 seems to be favoured when higher ratios and/or longer series are used, since a higher percentage of the generated SD remains real, achieving improved resemblance metrics.In principle, approaches A2 and A3 should not depend on this ratio as both metadata and time series are synthesized in these approaches.
Regarding statistical analysis of results on the resemblance metrics, most of correlation similarity checks, autocorrelation measurements and F-score metrics resulted in statistically significant differences.DLA was demonstrated to be statistically significant just for the MIMIC dataset, while no difference was found for TMET and PMData ones.

Utility assessment
Regarding the utility aspect of the generated data, the approach A3 was the best performing one for the TMET data, while approach A1 was better for PMData and DGAN M .This could lead to a similar rationale as discussed in the resemblance section, where a larger quantity of real data could potentially enhance the outcomes in the resultant datasets.
As for the significance of the performed comparisons, neither the models trained with the TMET dataset, nor the DGAN M model were found to be statistically Fig. 5 Multivariate correlations of the real (left) and synthetic PMData datasets.The upper row shows the data generated with the DGAN, while the lower row shows the data generated with the WGAN-GP.The three columns on the right refer to the different approaches followed during the synthesis.HR: Heart Rate.Note that blank values appear for the features with a standard deviation equal to zero, which lead to a null value in correlation calculations significant.In contrast, both the DGAN and the WGAN-GP trained with the PMData dataset and the WGAN-GP M model were found to be different to a great extent.

Privacy assessment
The MIA setup used for evaluation in this study is stringent, but no standardized time-series-specific MIA setup was found in the literature either.Future work in the area of privacy assessment will comprise research and incorporation of different attacks and thresholds to create potential scenarios that encompass a broader range of possibilities.
The WGAN-GP model was found to be susceptible to mode-collapse [26], which might explain the precision achieved by the adversary when attacking the WGAN-GP M generated data.Finding no significant differences in the WGAN-GP T suggests no approach is better than the other in terms of privacy preservation.
Attacks to the data generated with DGAN T suggest the most privacy-preserving approach is A2, which is consistent with the utility metrics that were obtained with it (best utility is achieved with A3).Contrary to our initial hypothesis, the privacy assessment on DGAN M resulted in approaches A1 and A2 outperforming A3 when using the DGAN model, which might relate to how the MIA parameters were configured.
Regarding the MIA simulations performed with the PMData dataset, the low number of subjects in the original dataset may have conditioned the results.A lower number of subjects decreases the possibility to find a potential match between records as fewer calculations are performed.

Contributions and limitations
The main contribution of this work is the detailed definition and systematic comparison of the three methods for generating time series together with metadata in terms of different evaluation dimensions, presenting them in a common framework and substantially extending analyses previously published by the authors in [12,13].The first limitation of the study is related to the fact that no standardized metrics can be found in the literature to evaluate longitudinal synthetic datasets.Even if utility and resemblance metrics can be adapted easily to this use case, it was difficult to objectively evaluate the privacy of the data considering the different thresholds and parameters that were introduced during the assessment, as well as the differences between the datasets that were used.Also, even if the longitudinal variables were jointly evaluated with the metadata, autocorrelation was the only time-series-specific metric that was calculated.For the future, a deeper research on this aspect is planned by analysing time series features such as central tendency (e.g.mean, mode and median), variability metrics (e.g.range and variance), trend metrics (e.g.slope), or seasonality metrics (e.g.seasonal index and cycle).
Moreover, the generated SD did not overcome any clinical validation and, therefore, the utility of the data for real scenarios or use cases was evaluated just by simulating possible downstream applications with the data as it is.Conclusions drawn from analyses derived from the SD should be checked and discussed by clinical experts to provide enough evidence for using SD in real scenarios or use cases.

Conclusion
In this research, a comparative assessment of three different STSG methods was conducted, evaluating three key dimensions of SD: resemblance (1), utility (2), and privacy (3).Although it may not be entirely decisive for setting a gold standard, the research was useful to state that the A2 approach performed worst overall, and therefore, approaches A1 and A3 should be prioritized for any use case.
Depending on the dataset characteristics, the use case, and the risk to be assumed by an end-user, the findings of this analyses point the user in the direction of implementing either A1 or A3 with regard to resemblance and utility metrics.This recommendation should be analyzed more cautiously before generalazing it, as further metric calculations may prompt a shift in direction.The privacy assessment is warranted as a main future work line, including the evaluation of datasets with different characteristics (e.g.low subject quantity), since it is pivotal for SD sharing or publication.

a
Significant results for a SL = 0.05 are underlined and bold

Fig. 2 Fig. 3
Fig. 2 Precision and recall value pairs for each generated dataset.There are eight markers per model and dataset combination.The recall metric is equivalent to the overlapped area between the real and synthetic distributions, while the precision refers to the sample quality

Fig. 6
Fig.6Membership Inference Attack (MIA) precision results.Each subplot represents the approach that was followed to synthesize the data, and each curve represents the pseudo-cross-validated mean precision of a MIA.AU: Arbitrary Unit

Table 2 T
-test p-values perfomed between approach pairs for the resemblance metrics a Significant results for a SL = 0.05 are underlined and bold

Table 3 T
-test p-values perfomed between approach pairs for the utility metrics

Table 4
Resemblance metric resultsValues in bold highlight the best-performing approach per trained model and metric DLA Data Labelling Analysis, AUROC Area Under the Receiver Operating Characteristic Curve, MAE Mean Absolute Error, DGAN DöppgelGANger, WGAN-GP Wasserstein Generative Adversarial Network with Gradient Penalty a The subscripts on the model column refer to the dataset used for the training step

Table 5
Utility metric resultsValues in bold highlight the best-performing approach per trained model and metric TSTR Train on Synthetic and Test on Real a The subscripts on the model column refer to the dataset used for the training step